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Abstract. The statistical analysis of measurement data has become a key component of many 
quantum engineering experiments. As standard full state tomography becomes unfeasible for 
large dimensional quantum systems, one needs to exploit prior information and the "sparsity" 
properties of the experimental state in order to reduce the dimensionality of the estimation 
problem. In this paper we propose model selection as a general principle for finding the 
simplest, or most parsimonious explanation of the data, by fitting different models and 
choosing the estimator with the best trade-off between likelihood fit and model complexity. We 
apply two well established model selection methods - the Akaike information criterion (AIC) 
and the Bayesian information criterion (BIC) - to models consising of states of fixed rank and 
datasets such as are currently produced in multiple ions experiments. We test the performance 
of AIC and BIC on randomly chosen low rank states of 4 ions, and study the dependence of the 
selected rank with the number of measurement repetitions for one ion states. We then apply 
the methods to real data from a 4 ions experiment aimed at creating a Smolin state of rank 4. 
The two methods indicate that the optimal model for describing the data lies between ranks 6 
and 9, and the Pearson x test is applied to validate this conclusion. Additionally we find that 
the mean square error of the maximum likelihood estimator for pure states is close to that of 
the optimal over all possible measurements. 
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1. Introduction 



Recent years have witnessed significant progress in the engineering and control of quantum 
systems [1, 2, 3]. From the preparation of exotic quantum states [4, 5, 6, 7] to the 
implementation of accurate quantum protocols [8, 9, 10, 11] experimentalists are confronted 
with the problem of reconstructing such mathematical objects statistically, from the outcomes 
of repeated measurements. The theoretical and experimental challenges have stimulated 
the development of a large array of new methods at the boundary between quantum theory 
and statistics: state estimation (or tomography) [12, 13, 14, 15, 16, 17], tomography for 
incomplete data [18, 19, 20] design of experiments [21, 22, 23], quantum process and detector 
tomography [24, 25] construction of confidence regions (error bars) [26, 27], quantum 
tests [28, 29] entanglement estimation [30], quantum homodyne tomography [31, 32, 33], 
asymptotic theory [34, 35, 36]; see also the monographs [37, 38] and the collections of papers 
[39,40]. 
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The importance and difficulty of quantum state tomography became evident in the landmark 
experiment [7] where entangled states of up to 8 ions were created and fully characterised. 
More recently the same group succeeded in creating entangled states of 14 ions [41] but 
their statistical reconstruction is beyond current computational capabilities! Therefore, there 
is great interest in alternative methods aimed at reducing the dimensionality of the state 
estimation problem without making unwarranted or unrealistic assumptions. Among these 
we mention the development of quantum compressed sensing methods [42, 43] which extend 
the "classical" ^-minimisation algorithms [44, 45] to the quantum set-up, and the estimation 
of many-body states based on lower dimensional families of matrix product states [46]. Both 
methods rely on the Ansatz that the states produced in real experiments are not completely 
arbitrary, but have some sparsity structure that can be exploited for more efficient estimation, 
e.g. low rank in the first case and finite correlations in the second. 

In this paper we propose and investigate a state tomography method which can also take 
advantage of the sparsity structure of the state, by adjusting the rank of the estimator according 
to the measurement data. However, although it shares with compressed sensing the goal 
of exploiting sparsity structures, our method is closer to the standard tomography set-up 
in the sense that it takes as input the dataset consisting of measurement counts rather than 
estimates of observables expectations, and it uses maximum likelihood for determining the 
estimator of a given rank. The philosophy of rank-based model selection is to choose an 
estimator which offers a good fit to the data, but in the same time contains a minimal 
number of parameters (Occam razor principle). For this, we construct a sequence of models 
consisting of states of fixed rank, and choose the model whose maximum likelihood estimator 
achieves the best trade-off between fit (likelihood) and model complexity. To quantify the 
trade-off we use two model selection methods, the Akaike information criterion (AIC) [47] 
and the Bayesian information criterion (BIC) [48] which have been used extensively in 
model selection problems; see [49, 50] for an introduction to model selection methods, and 
[51, 52, 53] for applications in quantum statistics. 

Although the method can be used for an arbitrary measurement set-up, we focus on the 
statistical model of multiple ions tomography (MIT) [7, 41], which constitutes a physically 
relevant testing ground for tomography of large dimensional systems. We emphasise that 
model selection does not assume any particular model, but rather lets the data select the model 
which gives the most suitable description. This offers the experimentalist an "honest" but also 
minimal estimation framework. The states created in many experiments have a good degree of 
purity, and therefore one would only need to compute the maximum likelihood over spaces of 
low rank, rather than full rank matrices. Furthermore, the principle of model selection can be 
applied to other families of models such as matrix product states, which may be more suitable 
in specific experimental conditions. 

The paper is organised as follows. In section 2 we introduce the statistical model of MIT, 
and discuss some of the existing estimation methods. To gain more insight, in section 3 
we investigate the problem of estimating pure states and in particular we find that the MIT 
measurement set-up is quasi-optimal in the sense that the mean norm-two distance squared 
is only slightly larger than that of the (asymptotically) optimal measurement. Section 4 
introduces the two rank-based model selection procedures based on the AIC and BIC, and 
discusses the implementation of the fixed rank models by using the Cholesky decomposition. 
The methods are applied to three randomly chosen states of ranks 1, 2 and 3 in section 5. We 
find that both criteria perform very well when the strictly positive eigenvalues are significant 
relative to the number of measurement samples, and explain this by analysing the asymptotics 
of the log-likelihood ratio statistic for different models. In section 6 we investigate the 
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dependence of the selected rank on purity and number of measurement repetitions in a one 
ion study. In section 7 we apply BIC and AIC to experimental data provided by Rainer Blatt's 
group from the University of Innsbruck. We find that the maximum log-likelihood flattens 
from rank 10 (see Figure 5), and the AIC and BIC predict ranks 9 and respectively 6, which 
capture the 4 significant eigenvalues and some of the eigenvalues of order 1CP 2 (see Table 
3). As additional check, we use the Pearson x 2 statistic to test the hypothesis H that state 
has rank at most rank 10, and find that there is no evidence to reject H - Section 8 contains a 
summary of the paper and an outlook for future work. 

2. Background on multiple ions tomography 

In this section we review the statistical model describing the measurement data collected in 
multiple ions tomography (MIT) experiments [7, 41, 11], and comment briefly on the existing 
estimation methods, with an emphasis on maximum likelihood estimation. 

The physical system consists of an array of trapped ions whose joint state can be manipulated 
by means of precisely tuned laser pulses. Since only two electronic energy levels are used for 
encoding the state, each ion can be describe mathematically as a two level system, so that the 
joint Hilbert space of k ions is C 2 . The state of the system is described by a density matrix 
p on this space, i.e. a 2 fe x 2 k complex selfadjoint matrix which is positive semidefinite and 
has trace one. Typically, the goal of the experiment is demonstrate the preparation of a certain 
target state to a sufficiently high degree of precision. To validate the result, a large number of 
preparation-measurement cycles are performed, and the collected measurement data are used 
to estimate the state produced in the preparation phase. 

In a nutshell, the measurement procedure consists of performing simultaneous Pauli 
measurements on all ions, each combination of Pauli observables being repeatedly measured 
n times. More precisely, each measurement is defined by a setting d which specifies which of 
the 3 Pauli observables <r x , <t v ,<t z is measured for each ion. For instance d := (x, y, z, z) is 
a 4 ions measurement setting, and in general for a fc-ions state there are 3 fe possible settings 
d e T>k := {x, y, z} k . For each fixed setting, the measurement produces random outcomes 
seO/ £ :={+l,-l} fe with probability distribution 

P„(s|d) := Tr(pP s d ) = ( e d |p|e d >, (1) 

where P d are one dimensional projections onto the vectors of the orthonormal basis 

|e?>:=|e£>®-"®|e*J>, s e O k := {+1, (2) 

formed by taking tensor products of eigenvectors of the Pauli matrices , ■ ■ ■ , &d k '■ 

a d \e d s ) =s|ef), de{x,y,z}, s e {+1,-1}- 

After repeating n times the measurement with setting d, the data can be summarised by 
counting the number of times that each possible outcome has occurred. The probability of 
a certain set of counts {7V(s|d) : s e Ok} is given by the multinomial distribution with 
probabilites given by (1), so that 

F p ({N(s\d) : s G O k }) = ^ 8 | d)| II P ^ 8 l d ) Jy( " d) ' d G Vk - (3) 
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Since any given setting d gives information only about the diagonal of the density matrix 
p with respect to the basis (2), the above procedure is repeated for all possible settings to 
obtain the complete 2 fe • 3 fe dataset consisting of counts {7V(s|d) : (s, d) e Ok x T> k } for all 
outcomes in each setting. As successive preparation-measurement cycles are independent of 
each other, the distribution over all possible datasets is the product of multinomials 

P p ({iV(s|d) : (s, d)eO k x V k }) = J]P p ({AT(s|d) : s e O k }). (4) 

d 

Let us ponder for a moment on the structure of this statistical model. If no assumption is made 
on the state, the parameter space is the (4 fe — 1) -dimensional convex set of density matrices 
Sk C M(C 2k ). We will verify that the above measurement scheme is informationally 
complete, or equivalently that the parameter p is identifiable in the sense that there is a one- 
to-one correspondence between p and the probability distribution P p given in (4). Since 
{a x , <j v , a z , cto := 1} form a basis in the space of 2 x 2 selfadjoint matrices, the tensor 
products 

<7; := <g> •• • <g> a ik , i := ...,i k )e {x,y,z,0} k 

form an orthonormal basis of the space of 2 fe x 2 k selfadjoint matrices with respect to the 
inner product (A, B) := Tr(AB). Therefore, any state can be expanded as 

p = ^2p i a i :=^2(a i ,p)a i , (5) 

i i 

and to estimate p it suffices to estimate the Fourier coeffcients p\. A naive unbiased estimator 
can be easily constructed based on the counts of any particular measurement setting d for 
which dj = ij whenever ij ^ 0. For example when k = 2, to estimate P( x ,z) we consider the 
counts from the setting d = (x, z), and define 

P(x,z) ■■= -i- +l)|d) + N((-l, -l)|d) - N((+l, -l)|d) - N((-l, +l)|d)] . 

V Z n 

While this proves that the state can be fully estimated, the naive estimator is generally not 
a bona-fide density matrix, and more importantly, has large estimation errors. The latter 
is due to the fact that p\ is constructed from the counts of a single setting and does not 
harness the information contained in the counts of the others. Indeed, since the projectors 
{P s d : s <E Ofe,d e V k } form a (highly) overcomplete set of vectors in M(C 2 ), any 
product of Pauli's a\ can be expressed in (continuously) many ways as a linear combination of 
projectors, each producing a linear estimator which could in principle be combined to obtain 
a significantly reduced MSE. However, finding the "optimal linear estimator" is problematic 
due to the fact that the empirical frequencies /n are noisy estimates of the probabilities 
P(s|d), and their covariance depends on the unknown state. An interesting proposal in 
this direction is the Kalman filter estimator developed in [14], but to our knowledge its 
performance in the case of MIT has not been extensively investigated. Another proposal 
put forward in [54] is to combine the naive estimator with a second stage rank-penalised 
minimisation of the norm-two square (Hilbert-Schmidt) distance to the final estimator. 

Maximum likelihood (ML) is one of the most commonly used estimation methods across 
statistics. Its popularity is due to the intuitive interpretation, versatility, and strong theoretical 
underpinning. Under certain regularity conditions the ML estimator is asymptotically optimal 
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(or efficient in statistical terminology) in the sense that its covariance achieves the Cramer- 
Rao bound in the limit of large samples, and has Normal (Gaussian) limiting distribution, 
with covariance equal to the inverse of the Fisher information matrix [55]. By discarding 
the constant factorial term in (3) and taking logarithm we can write the maximum likelihood 
estimator for MIT as 



where the maximum is computed over the set S k of fc-ions states r. Note that the ML estimator 
is invariant under reparametrisation, i.e. the ML estimator of a state functional / := f(p) is 
f(p). The ML estimator has been used extensively in quantum statistics [56], and an efficient 
iterative computational routine has been put forward in [57, 58]. Nevertheless, ML has been 
criticised for several perceived drawbacks [12, 13]. The first criticism is that the ML has the 
tendency to produce rank deficient estimators, when the true state has some small eigenvalues; 
this can be understood [12] by observing that the likelihood (seen as a function of the matrix 
elements) may attain its maximum at a point which lies outside the convex space of states 
iSfc, in which case the "constrained" MLE p will fall on boundary of Sk by the concavity of 
the log-likelihood function. The second, and in our opinion more serious criticism is that 
the asymptotic theory does not apply as such, to states which lie on the boundary. As a side 
remark, we note that the asymptotic theory does hold for the unconstrained ML estimator, 
for "generic" states which satisfy P p (s|d) > for all s, d. This may be used to prove 
the existence of the asymptotic distribution of the MLE (6), but the latter is likely to be 
complicated and impractical for establishing confidence regions (error bars). In this paper 
we focus on the performance of the proposed model selection estimation method, and refer 
to [26, 27] for two recent proposals for constructing confidence regions, and the forthcoming 
paper [59] for a comparative study of bootstrap and Fisher information methods. 

3. Estimation of pure states in the MIT setting 

Pure states are arguably the golden standard of most state preparation experiments [7, 41]. 
Therefore, we will start by considering the ideal situation in which the quantum state is 
assumed to be pure, and we would like to estimate it using the MIT dataset described in 
section 2. The goal is to get more insight into the statistical power of the measurement set- 
up and its asymptotic properties. This will prepare the ground for the next section where 
the purity assumption is lifted and the state is fitted to models of increasing rank, and model 
selection criteria are used for choosing a rank with a good trade-off between fit and model 
complexity. The findings are summarised at the end of the section, where we also clarify the 
relation to the compressed sensing set-up [42, 43] which uses as input the estimated values pi 
of the expectations of Pauli observables g\. 

The pure states ML estimator p can be computed as in (6), with the maximisation restricted 
to the space of pure (rank one) states on C . As figure of merit we consider the mean square 
error (MSE) 




s.d 



(6) 




Pill) 




(7) 
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where p\ are the Fourier coefficients with respect to the Pauli basis defined in (5). Note that 
for pure states the norm-two distance is related in a simple way to the (arguably more natural) 
norm-one distance ||p — p\\ := Tr(\p — p\) by the equality \\p — p\\ 2 = \\p — p\\\/\/2. 

We would like to address the following questions: 

1 . What is the MSE of the MLE ? 

2. Are we in an "asymptotic regime" ? 

3. Is the multiple ions measurement "optimal" in any sense ? 

The pure states form a compact manifold of dimension 2(2 fe — 1) which can be identified with 
the complex projective space CP 2fc+1 . Therefore, when restricting the MIT statistical model 
to pure states, the standard asymptotic efficiency theory [55] is applicable. For simplicity, we 
assume that \ tp) has the expansion with respect to the standard basis \tp) = ^ Ci\e\) such that 
ci 7^ 0, in which case we can parametrise the state by the real and the imaginary parts of the 
remaining coefficients 

9 -+ \i, 6 ) = y/\-\W\e x ) + + ie 2k _ 2+j )\ej), 9 e M 2 ^ 1 ', \\0\\ < 1. 

Note that due to the geometry of the projective space, any global parametrisation must be 
singular unless some points are cut out as we did here. However, as we are interested in 
the asymptotic behaviour of the ML estimator, the global properties are unimportant and we 
can always choose an appropriate local parametrisation for all practical purposes. The norm 
two-square distance (7) can be rewritten locally as a quadratic form 

Hasp -P 6 \\i = 0- efG(e){e of + (||<? - e|| 2 ), (8) 

where G(9) is a positive definite matrix whose explicit form can be easily computed. The 
maximum likelihood estimator 8 = 6 n is efficient , i.e. as n —¥ oo 

Vn(9-6)^N(0,I(9)- 1 ), (9) 

where N(0, 1(6)^ 1 )) is a the centered normal distribution with covariance matrix 
which is the inverse of the (classical) Fisher information matrix 1(9). In particular, from (8) 
and (9) we get 

lim E(||pfl - p s \\l) = Tr(G(6)I-\6)). (10) 

n— >oo 

To verify these results we simulated 100 datasets from a fixed but randomly chosen pure state 
of k = 4 ions, each dataset consisting of counts for n = 100 measurements per setting. 
Figure 1 shows the histogram of the square error \\p$ — || | whose empirically estimated 
MSE (green line) is very close to the asymptotic prediction (blue line) computed from (10) 
which is equal to 3.9 • 10~ 3 . More interestingly, we find that the MSE is also very close to 
the "quantum optimal" bound (red line) which describes the best MSE achievable with any 
measurement! The latter is given by the simple formula ( see [60] and references therein) 

™„c^ ((parameters 2(2 k - 1) 2(2 4 - 1) 

QMSE = — = \ = . 4 in ./ =3.7-10- 3 . (11) 

jjsamples 3 fc • n 3 4 • 100 
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Figure 1. Histogram of the norm-two error \\p — of the MLE p for 100 samples from a 
fixed pure state p. The mean square error (green line) is very close to the classical Cramer-Rao 
bound (blue line) as predicted by asymptotic theory, and the latter is only slightly larger than 
the quantum Cramer-Rao bound (red line), showing that for pure states the ions measurement 
is almost almost optimal among all measurements. 



This example shows that the MSE of the ML estimator agrees with the asymptotic theory 
when n = 100 and k = 4; we expect that the same holds for fixed n and larger k 
due to the favourable scaling of the number of samples n ■ 3 fe with respect to the number 
of parameters 2(2 k — 1). Moreover, the multiple ions measurement set-up appears to be 
quasi-optimal. This implies that adaptive strategies for choosing the settings cannot offer 
a significant improvement, but does not exclude the possibility that a similar performance 
can be achieved with a fraction of the settings. To further emphasise the point that the MIT 
dataset is very informative, and that ML can optimally extract this information from the data, 
we compare the above results with the MSE of the naive estimator discussed in section 2, and 
with the asymptotic MSE of a dataset consisting of estimates of the Pauli products obtained 
by lumping together the counts of each measurement setting into a single statistic. 

MSE of the naive estimator. With the square error defined as in (7), we note that the MSE 
of each coefficient p\ for which i\, . , . ,ik ^ 0, is of the order l/(n ■ 2 k ) since we are 
essentially dealing with the problem of estimating the mean of a random variable with values 
{+2~ k / 2 , —2~ k / 2 }. Therefore these coefficients alone (not counting those for which some 
ij are zero) bring a contribution of the order 3 fc / (n ■ 2 k ) which is larger than QMSE (1 1) by 
a factor (9/4) fc /2. For the particular example of k = 4 and n = 100 this gives an MSE of 
5 ■ 10~ 2 which is an order of magnitude larger than that of the MLE. 

MSE of the coarse grained data. At this point, it is natural to ask the following question. 
Suppose that we are given the 3 fc empirical averages of the Pauli products a- Y 

p t « Tr(pcf-i) = (ip\ai\ip), ii,...i k ^Q (12) 

which are obtained by computing one empirical average for each column of the original 
dataset. Is there a more efficient method to estimate the pure state \ip), from the data (12) 
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and what is its MSE ? Two important candidates are the compressed sensing and lasso 
algorithms [43] (with the slight difference that they would use a smaller number of settings, 
but proportionally more measurements per setting). Both methods aims at estimating the state 
by trying to match the empirical expectations p; with those of a selfajoint matrix, while in the 
same time penalising the trace norm of the matrix. Testing these methods is beyond the scope 
of this paper, but the asymptotic efficiency theory offers a shortcut to the answer of the above 
question. Applying the same methodology as before, but to the coarse grained data (12) we 
can predict that (asymptotically) the MSE of any estimator is bounded from below by that of 
the ML estimator p cg which in turn satisfies 

lim nE (||p cg - p\\l) = Tr(G(9)I cg (8)^ 1 ) (13) 

where the only difference with (11) is the Fisher information matrix which satisfies the 
inequality I cg {fi) < Figure 2 shows histograms of the asymptotic MSE (11) for the 

full MIT data (left panel) versus the MSE (13) of the coarse grained data (right panel). The 
histograms were produced with 250 randomly chosen pure states, k = 4 and n = 100. Note 
that the MSE of the coarse grained data is smaller than the (partial) estimated contribution of 
the naive estimator. However, the MSE is still an order of magnitude higher that that of the 
full dataset, due to the fact that a significant amount of information has been discarded in the 
process of retaining the Pauli products expectations. 




0.00388 0.00392 0.00396 0.03 0.04 0.05 0.06 0.07 0.08 

MSE MSE 



Figure 2. Histograms of asymptotic MSE's for 250 randomly chosen pure states, with k = 4 
and n = 100. Left panel: full counts dataset. Right panel: coarse grained dataset. Keeping 
only the empirical means of the Pauli products leads to a 10 fold increase in the MSE. 

To summarise, we conclude that MIT works because the different settings "overlap" with each 
other in the sense that the one dimensional projections |e|) (e|| form an overcomplete set of 
size 2 k x 3 k which is significantly larger than the dimension of the space of matrices 4 fe even 
for small k. Therefore, the measurement data is structured so that the counts for each setting 
provide a relatively small amount of information, but the dataset as a whole is very informative 
about the state. Reducing this dataset to a small number of expectations may be advantageous 
for the purpose of devising fast estimation algorithms, but underperforms from the viewpoint 
of statistical errors, for a given number of state re-preparations. This statement may seem 
to contradict the simulation results illustrated in Figure 1 of [43], where compressed sensing 
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and lasso are found to perform better than ML on datasets of the type (12). This apparent 
contradiction is lifted by the following observations: 

1) Our comparison is between the MSE of efficient estimators for two different types of 
data. Based on this we conclude that any estimator using the coarse grained data will 
asymptotically underperform ML based on the full counts experimental dataset. 

2) The comparison in [43] is different; it regards the performance of ML versus compressed 
sensing and lasso for the coarse grained data. Since a completely unknown state is not 
identifiable for the coarse grained model, the MLE is not consistent, and arguably should 
not be used in this case. 

4. Model selection for quantum tomography 

In the previous sections we discussed the extreme scenarios of "full" quantum tomography 
and estimation of pure states. In reality, the states produced in experiments tend to have one 
or few significant eigenvalues and a large number of small eigenvalues of different orders 
of magnitude, which account for the imperfections in the preparation procedure. Therefore, 
neither of the two settings seems to be suitable: the former underfits the real state while the 
latter overfits by trying to estimate eigenvalues that may not be statistically significant. 

This is a well known phenomenon in statistics, that occurrs in high (or infinite) dimensional 
problems such as estimating the probability density of a real valued random variable, or in 
non-linear regression where an unknown function is "learned" from estimates of its values 
at certain points. In such cases the maximum likelihood estimator overfits the data and is 
very "noisy". A possible solution is to use a penalised maximum likelihood estimator, which 
maximises the difference between the log-likelihood and a penalty measuring the complexity 
of the estimator, e.g. the number of non-zero coefficients with respect to an appropriate 
basis. More generally, one can design a class of statistical models with various degrees of 
complexity, and decide which model and which estimator from that model is most suitable for 
describing the data. Our aim is to apply the model selection methodology to state tomography, 
the models being the families of states of a given rank. The same methods can be used in tasks 
such as quantum homodyne tomography [31] where the state to be estimated is that of a light 
pulse (one mode continuous variables system), and a model could be the set of states with a 
given maximum number of photons [61]. 

To select the rank of the state we will use two well established methods: the Akaike 
information criterion (AIC) [47] and the Bayesian information criterion (BIC) [48]. Both 
methods amount to penalising the log-likelohood function by a factor proportional to the 
dimension of the model, and choosing the ML estimator with the smallest value of the 
information criterion. In the next section we give a brief general description of AIC and 
BIC, after which we discuss the parametrisation of the fixed rank quantum models, and the 
implementation of the model selection procedure. 

4.1. AIC versus BIC model selection 

Occam's razor is an old scientific principle which states that when trying to explain a 
phenomenon, one should choose the simplest model that adequately fits the data. A very 
complex model will be able to fit the given data almost perfectly but it will not be able to 
generalise very well. On the other hand, very simple models will not be able to describe the 
essential features of the data. Therefore, we must make a compromise and choose a model 
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which is as simple as possible, but no simpler. It is not surprising that many approaches have 
been proposed over the years for dealing with this key aspect in statistical modelling. 

The general framework of model selection is the following. We are given n samples 
X = {Xi, . . . , X n } from some unknown distribution P which we try to fit with a distribution 
from one of several possible statistical models 

Mr := {Pe r : r e 9 r C R p(r) }, r = l,...,D, 

where A4 r has a parameter r of dimension p(r). For simplicity we assume that X; e 
{1, . . . , a} are discrete random variables, as in the case of MIT measurements. We also 
assume that at least one of the D models contains the true distribution, or at least gives a 
reasonabale approximation to it. We denote by 9 r the ML estimator for the model M r , and 
by te r = 4> r ( x ) := Ya=\ log¥e r (Xi) log-likelihood function at 6 r . 

4.1.1. Akaike Information Criterion (AIC) The AIC for model M r is [47] 

AIC(r) = -2.^ p +2p(r), 

and the chosen model is the one with the minimum AIC. Since p(r) is larger for more complex 
models, the AIC formally biases against overly complicated models. Although the derivation 
of AIC is outside the scope of this paper, we briefly explain the idea behind the choice of 
penalty. Having computed the ML estimators for different models we would like to select the 
"best" one in the sense that the corresponding distribution Fg is the closest to the "truth " P 
with respect to the Kullback-Leibler distance (or relative entropy) 

a a 

K(F\¥ §r ) := £>(i)log(P(i)) - £>(i)log(P flr (i)). 

i=l i=l 

However, this quantity cannot be computed since P is unknown. Since the first terms on the 
right side is the same for all models, it can be neglected, and one can focus on estimating the 
second term, which nevertheless still depends on P. If instead of 8 r we had a fixed parameter 
9 r , this term would be the expected value of the log-likelihood at 9 r and could be estimated 
by £$ r (X.)/n, by the law of large numbers. However l~ e (X)/n is a biased estimator of the 

second term, due to the fact that the data has been already used in computing 9 r . Akaike 
showed that under the regularity conditions required by the asymptotic normality theory, the 
bias is approximately p(r)/n, so that ML which is the closest to the truth is approximately 
give by the minimizer of the AIC. 

4.7.2. Bayesian Information Criterion (BIC) The BIC for model M r is defined as [48] 

BIC(r) = -2 • i §r + p(r) log (n) 

where n is the sample size. Note that the BIC differs from the AIC only in the second term 
which increases with n, so that BIC favors simpler models (that is models with a smaller 
number of parameters) compared to AIC. But despite the superficial similarity between the 
AIC and BIC the latter is derived in a very different way, within a Bayesian framework. 

For simplicity, suppose that there are two competing models, M.\ and M.^ with parameters 
9\ and 9i respectively. One begins by assigning prior probabilities q\ and q 2 = 1 — q\ to 
the event that the observed data have been generated from either model. One also assigns 
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prior distributions 7Ti(#i) and ^2(^2) to the model parameters in each model. Then one can 
compute the marginal likelihoods which can be interpreted the probability of observing the 
data if model Mi is correct, having integrated out our ignorance about the parameters Q\ and 
9 2 in each model. Hence, one can apply Bayes theorem to evaluate the probability of model 
A4i being the true model given the observed data. A measure of the extent to which the data 
support model Mi over M\ is given by the posterior odds 

P(A4 2 |X) = P(X|M 2 ) q 2 
P(Mi|X) ~ PX|Mi) qi 

The first fraction on the right-hand side is called the Bayes factor and the second is known 
as the prior odds. The Bayes factor is a fundamental quantity in Bayesian theory and can be 
interpreted as a measure of the extent to which the data support model Mi over Mi when 
the prior odds are equal to one. The difference BIC(1) — BIC{2) can be shown to be a large 
sample approximation to the logarithm of the Bayes factor, so that the second model is chosen 
if the difference is positive. 

4.2. Parametrising models with fixed rank 

Here we describe the fixed rank models which will be used in model selection. Let T>(d, r) 
be the set of rank r states of a d-dimensional quantum system, i.e. those states which have 
exactly r non-zero eigenvalues, and let 

r 

TZ(d,r) := \JV(d,r) 

i=i 

be the set of states of rank at most r. Every state p has a unique spectral decomposition 

r 

p = A;P 
1=1 

where > are its distinct eigenvalues, and Pi is an eigenprojector whose dimension is 
equal to the multiplicity rrii of Xi. The spectral information (Ai, Pi, . . . , A r , P r ) can be used 
to construct a parametrisation of V(d,r) and lZ(d,r), which has the advantage of a direct 
physical interpretation. However, the practical implementation of such a parametrisation for 
computing the maximum likelihood estimator is less straightforward due to the orthogonality 
constraints for the eigenvectors, and the singularities appearing on lower dimensional 
manifolds consisting of states with non-trivial sets of multiplicities. A variation on this 
would be to parametrise the state by the set of eigenvalues and an eigenbasis, in which case 
the singularity problem is replaced by the non-identifiability of the different basis vectors 
corresponding to the same eigenvalue. 

We will describe an alternative parametrisation which is related to the Cholesky factorisation 
of the state. Recall that any positive definite matrix A e M(C d ) has a unique decomposition 

A = T*T (14) 

where T is an upper triangular matrix with strictly positive diagonal elements. Therefore there 
exists a one-to-one correspondence between full-rank states p and matrices T as described 
above, with the additional constraint 



Tr(T*T)=Y,\T ij \ 2 =Tr(p) = 1. 



(15) 
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We parametrise such a matrix T by the vector of real numbers 9 := (R, I, D) e R d 1 with 

R := (Re(T 12 ),...,Re(T d _ M )) 

I := (Im(T 12 ),...,Im(T d _ M )) (16) 
D := {T 22 , . . . ,T dd ) 

such that R, I are the real and imaginary parts of the off-diagonal elements ordered from the 
first to the d — 1 row, and from left to right along each row. By (14) and (15), 9 must satisfy 
the constraints D > and ||i?|| 2 + ||/|| 2 + ||D|| 2 < 1, and the left-top element of T is equal 
to 

Tn = Tn (5) = (1 - \\R\\ 2 + ||/|| 2 + ll^ll 2 ) 172 > 0. 

The Cholesky parametrisation of the full rank matrices can be extended, albeit with some 
caveats, to the spaces of rank-deficient matrices T>(d, r) and lZ(d, r). The idea is to consider 
a decomposition as in (14), but with T belonging to the set T + (d,r) of d x d upper 
triangular matrices with the bottom d — r rows equal to zero, and satisying T\\ , . . . , T rr > 0; 
equivalently, one can consider r x d trapezoidal matrices obtained by removing the zero 
lines of the triagular matrices. Since every T e T + (d,r) is of rank r, this guarantees that 
the corresponding state p has the same property. However, not all states of rank r can be 
decomposed in this way! Indeed it is easy to verify that if p = T*T then the r x r top-left 
principal minor of p must be of rank k, and therefore such a parametrisation excludes states in 
T>(d, r) which do not satisfy this property. Nevertheless, "generic" matrices of rank r do have 
principal minors of rank r, in the sense that those with smaller rank principal minor form a 
lower dimensional subset of T>(d, r). If restrict our attention to the subset V(d, r) + C T>(d, r) 
which excludes the "deficient" states, we find that the Cholesky decomposition exists and is 
unique, so that 

V(d, r)+ := {p = T*T : T e T d +} C V(d, r). 

What can we say about the complement T>(d, r) \ T>(d, r) + ? In order to have a Cholesky 
decomposition we need to relax the condition Tn, . . . , T rr > and consider the set T(d, r) 
of r-lines upper triangular matrices, with non-negative elements on the diagonal. In this case, 
the root T not only exists but is in general not unique. 

Let &(d, r) + be the set of real parameters 9 := (R, I, D) of a matrix T = Tg € T(d, r) + 
which are defined similarly to equation (16), and let Q(d,r) be the set of parameters 
associated to matrices in T>(k, r). We define two sequences of quantum statistical models: 

Q + (d,r):={p e =T 6 *T e :9ee(d,r) + }, r = l,...,d (17) 

Q(d,r) :={ Pe =T* e Tg : 9 eS(d,r)}, r = l,...,d (18) 

the first one consisting of rank r matrices with rank r principal minor, the second one 
describing (albeit not always uniquely) all matrices of rank up to r. The reason why we 
mention the two models is that each has some appealing features and some disadvantages. 
For Q(d, r) the advantage is that we deal with a nested set of models 

Q(d,l) C Q(d,2) c ••• C Q(d,d). 

The disadvantage is that the Cholesky parametrisation is not one-to-one in this case. On the 
other hand, Q(r, d) + offers a one-to-one parametrisation of rank r matrices in T>(d, r) + , with 
the disadvantage that the models are not nested, but instead Q(d, r) + lies on the boundary of 
Q(d, r + 1) + . While these facts are relevant to a theoretical analysis, for practical purposes 
the distinction between the two models is less important, and in all our numerical experiments 
we used the models Q + (d, r). 
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4.3. The implementation ofAIC and BIC model selection for rank-based models 

We return now to the state estimation problem, and describe how AIC and BIC model selection 
is applied to the family of rank-based models described above for a system consisting of k 
ions, i.e. d = 2 k . Let 

i 6 = £e ({N(s\d) :seO k ,de V k }) := ]T JV(s|d) logP pe (s|d) 

s,d 

be the log-likelihood of the measurement data, ignoring the constant factorial terms. The 
maximum likelihood estimators 9 r and p r for the model Q(2 k , r) + are : 

8 r := arg max £$, p r := oa . 

eee(2 fc ,r)+ 

In order to choose between the different models we compute the AIC and the BIC for each 
rank and select the model with the smallest value. In our case the two criteria are given by 

AIC(r) :=-2L+ 2p(2 k ,r) 

(!9) 

BIC(r) :=-2£ §r + p(2 k , r) log(n • 3 fc ) , 
with 

p(d,r) = 2 ■ d-r- r 2 - 1 (20) 

the dimension of the space of rank r matrices, and n ■ 3 k is the total number of measurements. 
In practice each criterion decreases with the rank until it reaches the minimum value after 
which it increases, so one only needs to compute the ML estimator up to the rank where 
the criterion begins to increase. For low rank states, this offers a the advantage of having to 
compute the maximum likelihood estimator on models of dimension approximately r ■ d rather 
than d 2 — 1 as standard ML. The disadvantage is that the likelihood function is not concave 
as in the full rank model, and may have several local maxima. 

To implement the ML estimation numerically, we used a standard maximisation routine of the 
statistics package R. Additionally, we developed an array of statistical analysis tools such as 
Fisher information, square errors, bootstrap, Pearson x 2 statistic which will be made available 
online. Although the computation of the log-likelihood was optimised for faster speed, the 
maximisation can probably be improved significantly by using more sophisticated routines. 

In the next sections we will discuss the results of several investigations on the performance of 
BIC and AIC model selection, using simulated and real data. 



5. Study 1: randomly chosen low rank states 

In a first simulation study we chose 3 "random" states of ranks 1, 2, and 3 of k = 4 ions, and 
generated 100 datasets from each state, each dataset with n = 100 measurement repetitions. 
We then computed the maximum likelihood estimators for the ranks between 1 and 4 and 
used AIC and BIC to select the optimal rank. The exact procedure used to generate "random" 
states is not very important, but it will be relevant that all non-zero eigenvalues of the states 
are significant. As illustrated in Table 1, BIC selected the correct rank for each state in roughly 
90% of the cases while for AIC the rate is about 80 %. Due to the different penalties, the AIC 
tends to over-estimate the rank of the state, while BIC has a slight tendency to under-restimate 
it. While at first sight this may appear to be a surprisingly good performance, we will show 
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AIC rank 




BIC rank 


true rank 


1 


2 


3 


4 




true rank 


1 


2 


3 


4 


1 


82 


9 


9 







1 


99 





1 





2 





74 


26 







2 


7 


90 


3 





3 





1 


80 


19 




3 





5 


95 






Table 1. AIC and BIC performance for 100 datasets generated by 3 randomly chosen states of 
ranks 1, 2 and 3. The tables shows the number of times AIC and BIC choose rank 1, 2 or 3 for 
each state. 

that it agrees very well with the predictions of asymptotic theory. For illustration, we consider 
the state of rank r — 2 denoted p, and show that the distributions of BIC{2>) - BIC{2) 
and BIC(1) — BIC (2) concentrate on the positive axis, so that BIC chooses the correct 
rank. Since their behaviours are determined by different mechanisms, we will study each BIC 
difference separately. A similar analysis can be performed for AIC. 

In the first case, 

BIC(3) - BIC(2) = - 2(£ §3 - £ §2 ) + log(n • 3 4 )(p(4, 3) - p(4, 2)) 

= -2{t h -t h ) + 242.98 (21) 
so the problem is to show that the log-likelihood ratio statistic 

A:=2(£ §3 -£ §2 ). 

is typically smaller than the penalty 242.98. For "regular" nested models, the asymptotic 
distribution of A is described by Wilks' theorem as discussed in section Appendix A. 
However, this is not directly applicable here since the rank 2 model lies on the boundary 
of the rank 3 one, due to the positivity constraints. Nevertheless, Wilks' theorem can be 
extended to more general situations where the two hypotheses can be "linearised" locally (see 
[62] chapter 16), in which case the limiting distribution depends on the local geometry of the 
two models and the Fisher information at each point. We will not pursue this analysis here 
but limit ourselves to giving a stochastic upper bound to the limiting distribution which will 
be sufficient for our purposes. The idea is to note that 

A: = 2 (^3-^)< 2 (^-^ 2 ) ( 22 ) 

where £g is the "unconstrained" maximum likelihood estimator obtained by maximising over 

the space V(d 7 r) D T>(d,r) consisting of matrices p of rank r which are not necessarily 
positive but must respect the property that P p (-|d) is a probability distribution for each d. 
The unconstrained MLE is easier to analyse theoretically and can be used to explain why 
MLE often produces rank deficient states when the true state has high purity [12]. Now, 
assuming that we are in the generic situation where all probabilities for the true rank-two state 
p are non-zero, this means that locally around p the rank two model is a regular submodel of 
the extended rank 3 model, and we can apply Wilks' theorem to conclude that 

2 (%-^ 2 )^X 2 (p(4,3)-p(4,2)). 

From (22) we get that A is stochastically bounded from above by x 2 (27) and similarly 
BIC (3) - BIC{2) is bounded from below by 242.98 - X 2 (27) which agrees with the 
simulations results illustrated in the left panel of Figure 3. Note that as n increases, the 
probability of BIC choosing the rank 3 model converges to zero due to the presence of the 
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log n factor in the penalty, while AIC is not rank consistent in the sense that it choses the 
higher rank with a probability which does not vanish with n, in agreement with the results 
illustrated in Table 1. Let us consider now the second difference BIC (I) — BIC(2), and 



BIC(3) - BIC(2) 



BIC(1)-BIC(2) 
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Figure 3. Histogram of BIC differences for the rank 2 state. Left panel BIC(3) - BIC (2); 
right panel BIC(l) — BIC(2). The values are in good agreement with the asymptotic 
predictions. 

note that its values are much larger, a illustrated in the right panel of Figure 3. It turns out 
that in this case the behaviour is not dominated by the complexity penalty but by the bias of 
the lower rank model with respect to the "correct" one, and in particular the distribution of 
the difference is state dependent. The key is to observe that while the rank 2 ML estimator p 2 
converges to the true state p, the rank one ML estimator pi converges to the state pi whose 
corresponding distribution P p j is the closest to the true distribution P p with respect to the 
relative entropy (or Kullback-Leibler divergence) 

p\ := argmin K(¥ p \P T ). 

t£X>(2 4 ,1) 

In conjunction with the law of large numbers we then obtain the almost sure convergence 

A = I^ 2 _^J (P„|P p .) asn^oo. 

For our particular example we used one of the rank one MLE's to compute an approximate 
value K (P T | P p ) w 11.33 which gives an estimate 

BIC{1) - BIC(2) = - 2(£ @i -£ @2 ) + log(n • 3 4 )(p(4, 1) - p(4, 2)) 

w 2 • 11.33 • 100 + log(100 • 3 4 )(p(4, 1) - p(4, 2)) 

w 2266 - 261 = 2005, 

in agreement with the histogram illustrated in the right panel of Figure 3. 

In conclusion, for low rank states with eigenvalues which are not very close to zero, the BIC 
and to lesser extent AIC, identify the correct rank with high probability, the latter having a 
tendency to overfit the true model. On the other hand, as we will see in the next section, the 
BIC may underfit the true model when one or more eigenvalues are small. 
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6. Study 2: one ion simulations 

We have seen that the performance of the model selection criteria depends on the spectrum of 
eigenvalues of the true state, and on the number of measurement repetitions. To investigate 
this dependence we performed a statistical experiment with three one-ion states (k=l) of 
different degrees of purity: a pure state, one with eigenvalues (0.95,0.05), and the other 
with eigenvalues (0.72,0.28). For each state we simulated datasets with varying number 
of repetitions n — 10, 50, 100, 250, 500. Table 2 shows the number of times (out of 1000 
samples) BIC and AIC choose the correct rank of the state, for all possible choices of states 
and measurement repetitions. As expected, in the case of the the pure and the mixed states 
both criteria require a small number of repetitions (of the order of 50) to give the correct 
answer. In the case of the almost pure state, we see a clear dependence with n: for small n 
the difference between the log-likelihoods does not off-set the complexity penalty and both 
criteria choose rank one; at n = 500 the balance tips in favour of the rank 2 model, with AIC 
switching faster than BIC, on average. 





Measurement Repetitions 


10 


50 


100 


250 


500 


State 1 


BIC 


987 


990 


994 


992 


996 


AIC 


945 


944 


919 


927 


930 


State 2 


BIC 


25 


83 


183 


394 


706 


AIC 


77 


312 


502 


802 


942 


State 3 


BIC 


384 


973 


998 


997 


988 


AIC 


594 


992 


998 


997 


988 



Table 2. Performance of BIC and AIC model selection for 3 states: pure (state 1), almost pure 
(state 2), and mixed (state 3). For each choice of number of repetitions, we record the number 
of times the BIC and AIC select the correct rank out of a total of 1000 simulations. 

Figure 4 shows the mean square errors (MSE) of the two MLE's 

MSE(r):=E(\\p r -p\\ 2 2 ), r = l,2 

as a function of n for each of the three states, with the pure state (rank one) estimator in black 
and the mixed state (rank two) estimator in red. For the pure state (left panel), the rank two 
estimator has a larger MSE due to the variance contribution from the third parameter, but 
the relative difference between the two MSE's is small for all n. In this case the rank one 
estimator proposed by both criteria is optimal both from the point of view of parsimony, as 
well as estimation error. For the mixed state (right panel), the rank one estimator has a large 
bias which dominates the MSE, while the rank two MSE decreases at rate 1 jn, as expected. At 
n = 50 the relative difference in risk is significant and both criteria choose the optimal rank- 
two estimator. The most interesting case is that of the almost pure state (middle panel). Here 
we see that the relative difference in MSE is not significant for small and medium number of 
repetitions (n = 10, 50, 100), but for larger n the error of the pure state estimator is dominated 
by its bias while the variance of the full state estimator becomes very small. This behaviour 
is picked up by the model selection criteria, which on average switch to the more complex 
model when n is in the interval between 200 and 500. 

In conclusion, the study shows that both methods become more sensitive to the true rank of 
the state as the number of repetitions increases, and the switching point increases (on average) 
with the purity of the state. As for the estimation error, the switch to the higher rank model 
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Figure 4. Mean square error for rank 1 (black circle) and rank 2 (red circle) estimators, as 
function of the number of measurement repetitions n = 10, 50, 100, 250, 500. Left: state 1 
(pure); Middle: state 2 (almost pure); Right: state 3 (mixed). 



appears to happen in the region where the MSE's of the two estimators starts to diverge from 
each other, which shows that even if the result is suboptimal for small n, the relative difference 
in errors remains small. Finally, the BIC is more aggressive in selecting the lower complexity 
model, due to the additional log-factor in the penalty. 



7. Study 3: model selection for 4 ions real data 

In the third study, we applied the model selection methods to experimental data provided by 
Rainer Blatt's group from the University of Innsbruck. The aim of the experiment [11] was 
to create a particular 4 ions bound entangled state of rank 4 called Smolin state [63], and 
the measurement dataset consisted of counts for the 3 4 measurement settings, with a number 
n = 4800 of repetitions for each setting. We computed the maximum likelihood estimators 
p r for all ranks r between 1 and 16, and found that the corresponding log-likelihoods reach a 
plateau at rank 10 (see Figure 5) which indicates that the rank 10 model is already sufficiently 
rich to describe the measurement data. Reinforcing this conclusion, we found that the value of 
the maximum likelihood for rank 10 (and the subsequent ones) was slightly larger than that of 
that of the maximum likelihood over all states computed with Hradil's iterative method [56], 
probably due to the fact that the latter had not reached the true maximum after 1000 iterations. 

The values of the AIC and BIC for all ranks are shown on the left side of Table 3. The 
two criteria reach minima at ranks r = 6 and respectively r = 9, as a result of the trade-off 
between the increasing penalty and the log-likelihood. As expected, the BIC chooses a smaller 
rank due to its larger complexity penalty, but both methods capture the top 4 eigenvalues of 
order 10 _1 and a few of the following ones of order 10~ 2 which account for experimental 
imprecision in creating the state. On the right side of Table 3 we listed for comparison the 
eigenvalues of the rank 10 and full rank estimators, showing perfect agreement in the first 
two decimal places. We emphasise that results should be taken as an indication that the 
experimental data is consistent with models whose rank could be chosen somewhere between 
6 and 10, rather than answering the ill posed question "what is the rank of the state". To make 
a more informed decision on the final choice of model, one can additionally use different 
model testing procedures such as the Pearson chi-square test discussed below. As we will see, 
the various arguments converge towards the conclusion that the rank 6 estimator may be too 
conservative, while the rank 10 model already fits the data very well. 
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Figure 5. Log-likelihood values for the maximum likelihood estimator as a function of rank. 



RANK 


AIC 


BIC 


EIGENVALUES 
MLE RANK 10 


EIGENVALUES 
MLE RANK 16 


1 


2397395 


2397722 


2.337 e-01 


2.332 e-01 


2 


2217096 


2217738 


2.290 e-01 


2.277 e-01 


3 


2170638 


2171573 


2.258 e-01 


2.253 e-01 


4 


2146295 


2147502 


1.725 e-01 


1.721 e-01 


5 


2145157 


2146614 


4.599 e-02 


4.487 e-02 


6 


2144830 


2146515 


2.656 e-02 


2.445 e-02 


7 


2144719 


2146611 


2.385 e-02 


2.229 e-02 


8 


2144652 


2146728 


1.948 e-02 


1.884 e-02 


9 


2144641 


2146880 


1.226 e-02 


1.155 e-02 


10 


2144648 


2147028 


1.067 e-02 


1.001 e-02 


11 


2144664 


2147164 





6.057 e-03 


12 


2144680 


2147279 





2.751 e-03 


13 


2144694 


2147369 





6.779 e-04 


14 


2144704 


2147433 





5.278 e-06 


15 


2144710 


2147472 





2.153 e-06 


16 


2144712 


2147484 





1.702 e-16 



Table 3. Left: values of AIC and BIC for the ML estimators of ranks 1 to 16. The minimum 
values of the two criteria are attained at ranks r = 9 and respectively r = 6. 
Right: eigenvalues of the MLE's of rank 10 and 16 in decreasing order 

7.1. Pearson x 2 -test 

As an additional tool for probing the conclusions of the model selection procedures, we recast 
the problem as that of testing between the hypotheses 

J Hq = "the dataset is generated by a state of rank at most r" 
Hi = "the dataset is generated by a state of rank higher than r" 

A standard approach to such a problem is based on using the Pearson x 2 -statistic. Following 
the general procedure described in appendix Appendix A, we consider the rank r MLE p r 
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with expected number of counts £7(s|d) := nPp r (s|d), and define the Pearson x 2 -statistic 



T(r) 



E 

s.d 



(iV(s|d)-£;(s|d)) 2 
E(s\d) 



(23) 



where iV(s|d) are the counts from the real data. Under the hypothesis H , the Pearson statistic 
has an asymptotic \ 2 distribution with number of degrees of freedom equal to the number of 
free parameters of the dataset minus the number of parameters of the model 

df(r) := 3 4 -(2 4 -l)-p(r,4). 

Therefore one can define the (asymptotically) level a test 



if T < t a 
if T > t a 



accept Hq 
accept Hi 



where the threshhold t a is chosen such that ¥(Y > t a ) = a for a x 2 (df(r)) -distributed 
random variable Y. In practice the \ 2 approximation works well for pure states (r = 1), 
and small rank states which have only a few small eigenvalues. However, if the state has a 
significant number of small eigenvalues, the distribution of T(r) may differ significantly from 
the asymptotic \ 2 distribution. We will not pursue a theoretical analysis here, but instead use 
bootstrap techniques [55] to estimate the distribution of T(r) and then perform the test with 
respect to the bootstrap distribution. The idea of bootstrap is to use the measurement data 
itself to construct a distribution which (under the hypothesis Hq) approximates that of T(r), 
and therefore can be used to define the threshhold t a instead of the \ 2 distribution. 

Parametric Bootstrap 




1000 1100 1200 

Pearson's Chi Square Statistic 



1300 



Figure 6. Pearson \ 2 statistic T (blue line), the limit \ 2 distribution (red curve) and the 
parametric bootstrap distribution for 100 bootstrap samples. The boostrap distribution is 
shifted with respect to the \ 2 due to the fact that the state is close to the boundary of the 
states space and the asymptotic theory does not hold. Based on the value of T we conclude 
that the hypothesis Hq is not rejected for any reasonable significance level. 



The bootstrap distributions are constructed as follows: 



1) Compute the maximum likelihood estimator p r and its probability distribution P ) g r (s|d); 
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2) Generate a large number N of independent datasets from the distribution ¥p r (s|d) of the 
maximum likelihood estimation ; 

3) Compute the maximum likelihood estimators p^ 00 *, . . . , p^ ot for the bootstrap datasets; 

4) Compute the Pearson \ 2 statistic for each bootstrap sample and its MLE as in (23); 

5) Apply the x 2 test using the empirical distribution of the boostrap x 2 statistics. 

The results of applying the \ 2 test based on the bootstrap distribution to the rank 10 model are 
illustrated in Figure 6. The value of the test is T = 1039 is which means that the hypothesis 
H is accepted. 

8. Conclusions and outlook 

Statistical inference has become a key tool in interpreting the measurement data in quantum 
engineering experiments, which require precise, efficient and informative estimation methods. 
However, standard full state tomography becomes unfeasible for large dimensional quantum 
systems [41]. In this paper we proposed model selection as a general principle for approaching 
state estimation problems. As in [42, 43, 46] the aim is to reduce the dimensionality of 
the problem by taking advantage of the "sparsity" properties of quantum states in realistic 
experimental settings. The route to this goal is however different. The philosophy in model 
selection is to try to find the simplest, or most parsimonious explanation of the data, by 
fitting different models (often of increasing complexity) and choosing the estimator with the 
best trade-off between likelihood and complexity. Concretely, we looked at the problem of 
selecting the rank of the estimator, by using two well known methods: Akaike's information 
criterion (AIC) and the Bayesian information criterion (BIC). In both cases the fit-complexity 
trade-off is realised by penalising the log-likelihood of the data with a measure of complexity 
proportional to the number of parameters of the fixed rank model. We have tested AIC and 
BIC in several real data and simulation studies which we summarise here. 

Pure states. We studied the performance of (rank one) ML for pure states and found a 
very good agreement with the asymptotic predictions based on Fisher information and the 
efficiency of the ML estimator. More interestingly, we found that the MSE is only slightly 
larger than the MSE of the best possible measurement predicted by quantum version of the 
asymptotic theory. In particular this rules out the possibility of significantly improving the 
MSE by means of adaptive measurement design techniques. The (asymptotic) MSE of the 
full counts dataset was compared to that of the "coarse grained" data obtained by estimating 
the means of the Pauli products corresponding to each measurement setting, as used in 
compressed sensing algrithms [42, 43]. For 4 ions, the latter is an order of magnitude larger 
than the former due to the loss of information when discarding the full counts statistics. 

Study 1. For 4 ions states of ranks between 1 and 3 we found that both AIC and BIC identify 
the correct rank in 80%-90% of the cases, when the smallest non-zero eigenvalue is not too 
close to zero. The results are explained by using the ML asymptotic theory. 

Study 2. We analysed the performance of AIC and BIC as a function of the number of 
measurement repetitions and the purity of the state, for a toy example consisting of one ion 
states. With only a small number of repetitions, both methods identify the correct rank for 
pure and "pretty mixed" states. For an "almost pure" state, the model choice switches to rank 
2 as the number of repetitions increases. The switching happens roughly at the point where 
the MSE of the "wrong" rank 1 estimator becomes significantly larger that that of the correct 
model, indicating that model selection is only slightly suboptimal in terms of the MSE. 
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Study 3. We applied model selection to the 4 ions experimental data provided by Rainer Blatt's 
group from the University of Innsbruck. The target state of the experiment was an equal 
mixture of 4 orthogonal pure states, and BIC and AIC selected rank 6 and respectively 9, with 
both estimators capturing the principle eigenvalues and (some of) the noisy components due 
to imperfections in the preparation procedure, of the order 1CP 2 . While the BIC prediction 
seems too conservative, we find that a rank 10 estimator gives a very good explanation 
of the data from several perspectives: log-likelihood values, eigenvalues of the estimators, 
hypothesis testing. 

Overall, the numerical results indicate that model selection gives sensible answers, and can 
be used as an alternative to full tomography and compressed sensing. In principle the method 
works for any state, but is designed to take advantage of the lower complexity of small rank 
states. The drawback is the computational complexity of finding the MLE over states of 
fixed rank. Therefore it would be interesting to see whether ideas from the different methods 
can be combined in a fast, scalable and statistically efficient estimator. A possible future 
direction is apply model selection to state estimation for other types of models such as classes 
of matrix product states, and to system identification problems. Another topic of interest is 
the computation of confidence intervals (error-bars). Last but not least, there is a need for a 
deeper theoretical understanding of the quantum tomography statistical model. We mention 
two important questions: how does the state's proximity to the boundary affect the standard 
asymptotic theory, and how does the model behave for a large number of ions? This would 
hopefully lead to improved estimation algorithms and information criteria for model selection. 
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Appendix A. Pearson x 2 -statistic and Wilks' Theorem 

For reader's convenince we collect here two important results used in the paper. We refer to 
[62, 55] for more details. 

Theorem Appendix A.l (Pearson's x 2 statistic). Let X\, . . . X n be i.i.d. samples from the 
discrete distribution Fg over {1, . . . ,p}, where V := {Fg : 9 G O C K m } is a sufficiently 
regular model with <d an open set. Let N(i) be the number of counts of the outcome i in the 
sample, and let E(i) = nPg(i) be the expected counts. Then, the Pearson \ 2 statistic 



converges in law as n — > oo to the \ 2 distribution with m degrees of freedom. 

Theorem Appendix A.2 (Wilks' Theorem). Let V := {P e : 6 e 6 = R m } be a sufficiently 
regular model and let Vo is the submodel with parameter space 6o := {f £ 6 : f i = ■ • ■ = 
Ok = 0} for some k < m. Let X := {X\, . . . , X n } be i.i.d. samples from Fg and let A be the 



T: =E 



(JV(i)--g(i)) 2 
25(i) 
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log-likelihood ratio statistic 



A := 2 



sup MX)- sup fy(X)) 
e'ee e' ee 



If 9 € 0o, f/zen A converges in law as n — > oo to the x 2 distribution with k degrees of freedom. 

In both cases, it is essential that the parameter does not lie on the boundary, in order to be able 
to apply the asymptotic normality theory of the MLE. This condition is violated for states 
whose rank is strictly smaller than that of the fixed rank model in which they are considered. 
Therefore care must be taken before applying these results directly, and indeed our results 
show that the \ 2 asymptotics fail in some cases. A more refined asymptotic analysis taking 
into account the boundary effects will be pursued elsewhere. 
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